Identifying Mislabeled Training Data

نویسندگان

  • Carla E. Brodley
  • Mark A. Friedl
چکیده

This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal of this approach is to improve classiication accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classiiers that serve as noise lters for the training data. We evaluate single algorithm, majority vote and consensus lters on ve datasets that are prone to labeling errors. Our experiments illustrate that ltering signiicantly improves classiication accuracy for noise levels up to 30%. An analytical and empirical evaluation of the precision of our approach shows that consensus lters are conservative at throwing away good data at the expense of retaining bad data and that majority lters are better at detecting bad data at the expense of throwing away good data. This suggests that for situations in which there is a paucity of data, consensus lters are preferable, whereas majority vote lters are preferable for situations with an abundance of data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Kernel Based Detection of Mislabeled Training Examples

The problem of identifying mislabeled training examples has been examined in several studies, with a variety of approaches developed for editing the training data to obtain better classifiers. Many of these approaches involve applying an individual or an ensemble of classifiers to the training set and filtering the mislabeled examples based on their consistency with respect to the classifier’s ...

متن کامل

Improving Automated Land Cover Mapping by Identifying and Eliminating Mislabeled Observations from Training Data

This paper presents a new approach to identifying and eliminating mislabeled training samples. The goal of this technique is to decrease the error of classification algorithms by improving the quality of the training data. The approach employs an ensemble of classifiers that serve as a filter for the training data. Using an n-fold cross validation, the training data is passed through the filter...

متن کامل

Identifying the Mislabeled Training Samples of ECG Signals using Machine Learning

The classification accuracy of electrocardiogram signal is often affected by diverse factors in which mislabeled training samples issue is one of the most influential problems. In order to mitigate this negative effect, the method of cross validation is introduced to identify the mislabeled samples. The method utilizes the cooperative advantages of different classifiers to act as a filter for t...

متن کامل

Identifying and Eliminating Mislabeled Training Instances

This paper presents a new approach to identifying and eliminating mislabeled training instances. The goal of this technique is to improve classiication accuracies produced by learning algorithms by improving the quality of the training data. The approach employs an ensemble of clas-siiers that serve as a lter for the training data. Using an n-fold cross validation, the training data is passed t...

متن کامل

Boosted Noise Filters for Identifying Mislabeled Data

In many practical classification problems, mislabeled data instances (i.e., class noise) exist in the acquired (training) data and often have a detrimental effect on the classification performance. Identifying such noisy instances and removing them from training data can significantly improve the trained classifiers. One such effective noise detector is the so-called ensemble filter, which pred...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • J. Artif. Intell. Res.

دوره 11  شماره 

صفحات  -

تاریخ انتشار 1999